This quick task is to show what quick data analysis with python. Data is anonymized and concerns taxi like services.
There are two data sets
data_orders - contains data on user orders and includes columns such as:
data_offers - contains pairs of order number - ID of the driver to whom the order was offered.
At the moment when the client clicks on the "Order" button in the application, the matching system looks for the most relevant drivers and offers them an order. The task is to investigate some matching metrics for orders that did not complete successfully (the client did not receive the car in the end).
import pandas as pd
import numpy as np
import plotly.express as px
from IPython.core.display import HTML
0. Data preparation
df_offers = pd.read_csv("data_offers.csv")
df_orders = pd.read_csv("data_orders.csv")
df_orders.dtypes
order_datetime object origin_longitude float64 origin_latitude float64 m_order_eta float64 order_gk int64 order_status_key int64 is_driver_assigned_key int64 cancellations_time_in_seconds float64 dtype: object
df_orders['order_datetime_DT'] = df_orders['order_datetime'].apply(pd.to_datetime)
df_orders['order_daytime'] = pd.to_datetime(df_orders['order_datetime_DT']).dt.hour
Distribution of orders by reasons of failure: cancellations before and after the appointment of a driver, rejects
df_orders['order_status_key'].value_counts().plot(kind='bar')
<AxesSubplot:>
Conclusions
The largest number of orders is canceled because of client (Over twice more than cancelled by the system).
Graph of the distribution of fails by hours
fig = px.histogram(df_orders, x="order_daytime", color="order_status_key", marginal="violin")
fig.show()
The most fail hours are in the morning between 7 and 9 am with it's peak at 8 am and in the night between 9pm and 3 am.
Conclusions
Explanation for this can be that those hours are busy hours, therefore statisticly there are the most fails. To confirm this thesis a comparison to the total orders is required. For the fails between 9pm and 3am the same thesis as above should be checked.
Plot average time to cancellation (cancellations_time_in_seconds) with and without a driver, by the hour. If there are outliers in the data, it is better to remove them
df_cancellation_not_blank = df_orders.dropna(subset=['cancellations_time_in_seconds'])
df_cancellation_not_blank = df_cancellation_not_blank
df_cancellation_not_blank.describe()
| origin_longitude | origin_latitude | m_order_eta | order_gk | order_status_key | is_driver_assigned_key | cancellations_time_in_seconds | order_daytime | |
|---|---|---|---|---|---|---|---|---|
| count | 7307.000000 | 7307.000000 | 2811.000000 | 7.307000e+03 | 7307.0 | 7307.000000 | 7307.000000 | 7307.000000 |
| mean | -0.964020 | 51.450557 | 441.610459 | 3.000597e+12 | 4.0 | 0.384700 | 157.892021 | 12.646093 |
| std | 0.021619 | 0.011558 | 288.057123 | 2.436466e+07 | 0.0 | 0.486558 | 213.366963 | 7.425299 |
| min | -1.066952 | 51.399523 | 60.000000 | 3.000550e+12 | 4.0 | 0.000000 | 3.000000 | 0.000000 |
| 25% | -0.973939 | 51.444823 | 233.000000 | 3.000583e+12 | 4.0 | 0.000000 | 45.000000 | 7.000000 |
| 50% | -0.966654 | 51.452237 | 372.000000 | 3.000594e+12 | 4.0 | 0.000000 | 98.000000 | 13.000000 |
| 75% | -0.949864 | 51.456704 | 653.500000 | 3.000623e+12 | 4.0 | 1.000000 | 187.500000 | 20.000000 |
| max | -0.867088 | 51.496169 | 1559.000000 | 3.000633e+12 | 4.0 | 1.000000 | 4303.000000 | 23.000000 |
There are outliers - therefore observations shall be cut. After playing with data lower and upper 5 percentile will be cut off.
# 2. Getting rid of outliers
q_low = df_cancellation_not_blank["cancellations_time_in_seconds"].quantile(0.05)
q_hi = df_cancellation_not_blank["cancellations_time_in_seconds"].quantile(0.95)
df_filtered = df_cancellation_not_blank[(df_cancellation_not_blank["cancellations_time_in_seconds"] < q_hi) & (df_cancellation_not_blank["cancellations_time_in_seconds"] > q_low)]
# . 05-95 percentile
df_filtered.describe()
| origin_longitude | origin_latitude | m_order_eta | order_gk | order_status_key | is_driver_assigned_key | cancellations_time_in_seconds | order_daytime | |
|---|---|---|---|---|---|---|---|---|
| count | 6534.000000 | 6534.000000 | 2469.000000 | 6.534000e+03 | 6534.0 | 6534.000000 | 6534.000000 | 6534.000000 |
| mean | -0.964093 | 51.450704 | 444.827055 | 3.000597e+12 | 4.0 | 0.377870 | 127.886134 | 12.642485 |
| std | 0.021469 | 0.011529 | 292.153826 | 2.432113e+07 | 0.0 | 0.484892 | 104.830568 | 7.445468 |
| min | -1.066952 | 51.399523 | 60.000000 | 3.000550e+12 | 4.0 | 0.000000 | 12.000000 | 0.000000 |
| 25% | -0.973802 | 51.444953 | 228.000000 | 3.000583e+12 | 4.0 | 0.000000 | 50.000000 | 7.000000 |
| 50% | -0.966762 | 51.452512 | 411.000000 | 3.000594e+12 | 4.0 | 0.000000 | 99.000000 | 13.000000 |
| 75% | -0.950111 | 51.456726 | 657.000000 | 3.000623e+12 | 4.0 | 1.000000 | 178.000000 | 20.000000 |
| max | -0.867088 | 51.495445 | 1559.000000 | 3.000633e+12 | 4.0 | 1.000000 | 543.000000 | 23.000000 |
df_filtered_driver_1 = df_filtered[df_filtered['is_driver_assigned_key'] == 1]
df_filtered_driver_0 = df_filtered[df_filtered['is_driver_assigned_key'] == 0]
def filteredDriver(df, group_by_list, agg_what, agg_function_list, new_cols_names_list):
df_g = df.groupby(group_by_list).agg({agg_what: agg_function_list})
df_g.columns = new_cols_names_list
df_g = df_g.reset_index()
return df_g
df_filtered_driver_1_g = filteredDriver(df_filtered_driver_1
, ['order_daytime']
, 'cancellations_time_in_seconds'
, ['mean']
, ['cancellations_MEAN_d1'])
df_filtered_driver_0_g = filteredDriver(df_filtered_driver_0
, ['order_daytime']
, 'cancellations_time_in_seconds'
, ['mean']
, ['cancellations_MEAN_d0'])
df_drivers_0_1_g = pd.merge(df_filtered_driver_1_g, df_filtered_driver_0_g, how="inner", on = 'order_daytime')
df_drivers_0_1_g[['cancellations_MEAN_d1', 'cancellations_MEAN_d0']].plot(color = ['#8B008B', '#00CED1'], legend = 1)
<AxesSubplot:>
Conclusions
On average, cancelation time extends when a driver is assigned. Also overall cancelation time extends during lover human activity hours (from midnight to early morning). In order to get more detailed conclusions, the above statistics should broken down including cancelation type well.
Distribution of the mean ETA over the hours
df_eta_not_blank = df_orders.dropna(subset=['m_order_eta'])
df_eta_not_blank_mean = df_eta_not_blank.groupby('order_daytime')['m_order_eta'].mean()
df_eta_not_blank_mean.plot(kind='bar', color = '#20B2AA')
<AxesSubplot:xlabel='order_daytime'>
Conclusions
Average estimated time of arrival gets longer during busy hours (in the morning and in the afternoon).
This can be also explained by the traffic - during busy hours its harder to get there on time.
Hourly breakdown of the average number of drivers who were offered an order
df_joined = pd.merge(df_orders, df_offers, how="inner", on = 'order_gk')
df_joined['order_datetime_DT'] = df_joined['order_datetime'].apply(pd.to_datetime)
df_joined['order_daytime'] = pd.to_datetime(df_joined['order_datetime_DT']).dt.hour
df_joined.head(3)
| order_datetime | origin_longitude | origin_latitude | m_order_eta | order_gk | order_status_key | is_driver_assigned_key | cancellations_time_in_seconds | order_datetime_DT | order_daytime | offered_driver_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18:08:07 | -0.978916 | 51.456173 | 60.0 | 3000583041974 | 4 | 1 | 198.0 | 2023-02-06 18:08:07 | 18 | 300020372 |
| 1 | 20:57:32 | -0.950385 | 51.456843 | NaN | 3000583116437 | 4 | 0 | 128.0 | 2023-02-06 20:57:32 | 20 | 300022635 |
| 2 | 20:57:32 | -0.950385 | 51.456843 | NaN | 3000583116437 | 4 | 0 | 128.0 | 2023-02-06 20:57:32 | 20 | 300014328 |
df_count_drivers = df_joined.groupby(['order_daytime', 'order_gk']).agg({'offered_driver_id': ['count']})
df_count_drivers.columns = ['drv_count']
df_count_drivers = df_count_drivers.reset_index()
df_count_drivers
df_drivers_per_hour_mean = df_count_drivers.groupby('order_daytime').agg({'drv_count': ['mean', 'median']})
df_drivers_per_hour_mean.plot(kind='bar', color = ['#20B2AA', '#FFE4C4'], legend = 0)
<AxesSubplot:xlabel='order_daytime'>
df_drivers_per_hour_mean.describe()
| drv_count | ||
|---|---|---|
| mean | median | |
| count | 24.000000 | 24.000000 |
| mean | 3.867662 | 3.458333 |
| std | 0.581858 | 0.721060 |
| min | 2.543478 | 2.000000 |
| 25% | 3.587739 | 3.000000 |
| 50% | 3.873392 | 3.500000 |
| 75% | 4.285998 | 4.000000 |
| max | 4.623952 | 5.000000 |
Conclusions
I would not say there's a shortage nor excess of drivers. When considering the mean - yes, perturbations occur. It should be kept in mind that, between 3 and 6 am the traffic should be lower, therefore the drivers buffor does not need to be kept around the average. When it comes to the excess of drivers, there is a portential for optimalization between 6pm and 3 am, when there little traffic, eta around the average, yet high number of avalible drivers. But this should be confirmed by general purpose analysis.
Hexagons
# imports for this one
import h3
from folium import Map, Marker, GeoJson
import folium
import branca.colormap as cm
from geojson.feature import *
import json
# took me an hour to get working h3 package... so take your time
h3_df = df_orders
h3_df = h3_df[['order_gk', 'origin_latitude', 'origin_longitude']]
h3_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10716 entries, 0 to 10715 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_gk 10716 non-null int64 1 origin_latitude 10716 non-null float64 2 origin_longitude 10716 non-null float64 dtypes: float64(2), int64(1) memory usage: 251.3 KB
# THIS WHOLE PART IS COPIED FROM https://github.com/ak4728/h3-nyc-taxi and changed a bit
def counts_by_hexagon(df, resolution):
"""
Use h3.geo_to_h3 to index each data point into the spatial index of the specified resolution.
Use h3.h3_to_geo_boundary to obtain the geometries of these hexagons
Ex counts_by_hexagon(data, 9)
"""
df = df[["lat","lng"]]
df["hex_id"] = df.apply(lambda row: h3.geo_to_h3(row["lat"], row["lng"], resolution), axis = 1)
df_aggreg = df.groupby(by = "hex_id").size().reset_index()
df_aggreg.columns = ["hex_id", "value"]
df_aggreg['geometry'] = df_aggreg.hex_id.apply(lambda x: { 'type' : 'Polygon',
'coordinates': [h3.h3_to_geo_boundary(h=x,geo_json=True)]
})
return df_aggreg
def hexagons_dataframe_to_geojson(df_hex, file_output = None):
"""
Produce the GeoJSON for a dataframe that has a geometry column in geojson
format already, along with the columns hex_id and value
Ex counts_by_hexagon(data)
"""
list_features = []
for i,row in df_hex.iterrows():
feature = Feature(geometry = row["geometry"] , id=row["hex_id"], properties = {"value" : row["value"]})
list_features.append(feature)
feat_collection = FeatureCollection(list_features)
geojson_result = json.dumps(feat_collection)
#optionally write to file
if file_output is not None:
with open(file_output,"w") as f:
json.dump(feat_collection,f)
return geojson_result
def choropleth_map(df_aggreg, border_color = 'black', fill_opacity = 0.7, initial_map = None, with_legend = False,
kind = "linear"):
"""
Creates choropleth maps given the aggregated data.
"""
#colormap
min_value = df_aggreg["value"].min()
max_value = df_aggreg["value"].max()
m = round ((min_value + max_value ) / 2 , 0)
#take resolution from the first row
res = h3.h3_get_resolution(df_aggreg.loc[0,'hex_id'])
if initial_map is None:
initial_map = Map(location= [51.456843, -0.950385], zoom_start=12, tiles="cartodbpositron",
attr= '© <a href="http://www.openstreetmap.org/copyright">OpenStreetMap</a> contributors © <a href="http://cartodb.com/attributions#basemaps">CartoDB</a>'
)
#the colormap
#color names accepted https://github.com/python-visualization/branca/blob/master/branca/_cnames.json
if kind == "linear":
custom_cm = cm.LinearColormap(['green','yellow','red'], vmin=min_value, vmax=max_value)
elif kind == "outlier":
#for outliers, values would be -11,0,1
custom_cm = cm.LinearColormap(['blue','white','red'], vmin=min_value, vmax=max_value)
elif kind == "filled_nulls":
custom_cm = cm.LinearColormap(['sienna','green','yellow','red'],
index=[0,min_value,m,max_value],vmin=min_value,vmax=max_value)
#create geojson data from dataframe
geojson_data = hexagons_dataframe_to_geojson(df_hex = df_aggreg)
#plot on map
name_layer = "Choropleth " + str(res)
if kind != "linear":
name_layer = name_layer + kind
GeoJson(
geojson_data,
style_function=lambda feature: {
'fillColor': custom_cm(feature['properties']['value']),
'color': border_color,
'weight': 1,
'fillOpacity': fill_opacity
},
name = name_layer
).add_to(initial_map)
#add legend (not recommended if multiple layers)
if with_legend == True:
custom_cm.add_to(initial_map)
return initial_map
def plot_scatter(df, metric_col, x='lng', y='lat', marker='.', alpha=1, figsize=(16,12), colormap='viridis'):
"""
Scatter plot function for h3 indexed objects
"""
df.plot.scatter(x=x, y=y, c=metric_col, title=metric_col
, edgecolors='none', colormap=colormap, marker=marker, alpha=alpha, figsize=figsize);
plt.xticks([], []); plt.yticks([], [])
h3_df.columns = ['dt','lat','lng']
df_aggreg = counts_by_hexagon(df = h3_df, resolution = 8)
df_aggreg.sort_values(by = "value", ascending = False, inplace = True)
hexmap = choropleth_map(df_aggreg = df_aggreg, with_legend = True)
folium.map.LayerControl('bottomright', collapsed=False).add_to(hexmap)
hexmap
cumsum = (df_aggreg["value"]/df_aggreg["value"].sum()).sort_values(ascending = False).cumsum()
df_aggreg["is80perc"] = pd.Series(0, index=cumsum.index).where(cumsum >= 0.8, 1)
hex_80perc = df_aggreg["is80perc"].sum()
print('Number of hexes that contain 80:', hex_80perc)
df_aggreg.head(hex_80perc+1)
Number of hexes that contain 80: 23
| hex_id | value | geometry | is80perc | |
|---|---|---|---|---|
| 97 | 88195d2b1dfffff | 1497 | {'type': 'Polygon', 'coordinates': [((-0.96741... | 1 |
| 96 | 88195d2b1bfffff | 870 | {'type': 'Polygon', 'coordinates': [((-0.95072... | 1 |
| 93 | 88195d2b15fffff | 774 | {'type': 'Polygon', 'coordinates': [((-0.97531... | 1 |
| 91 | 88195d2b11fffff | 707 | {'type': 'Polygon', 'coordinates': [((-0.96301... | 1 |
| 95 | 88195d2b19fffff | 667 | {'type': 'Polygon', 'coordinates': [((-0.95512... | 1 |
| 20 | 88195d284dfffff | 653 | {'type': 'Polygon', 'coordinates': [((-0.94632... | 1 |
| 63 | 88195d2a27fffff | 414 | {'type': 'Polygon', 'coordinates': [((-0.93842... | 1 |
| 89 | 88195d2b0bfffff | 372 | {'type': 'Polygon', 'coordinates': [((-0.97182... | 1 |
| 62 | 88195d2a25fffff | 362 | {'type': 'Polygon', 'coordinates': [((-0.94282... | 1 |
| 92 | 88195d2b13fffff | 346 | {'type': 'Polygon', 'coordinates': [((-0.95862... | 1 |
| 85 | 88195d2b03fffff | 257 | {'type': 'Polygon', 'coordinates': [((-0.97972... | 1 |
| 94 | 88195d2b17fffff | 210 | {'type': 'Polygon', 'coordinates': [((-0.97091... | 1 |
| 108 | 88195d2b39fffff | 184 | {'type': 'Polygon', 'coordinates': [((-0.98761... | 1 |
| 23 | 88195d2861fffff | 182 | {'type': 'Polygon', 'coordinates': [((-0.97441... | 1 |
| 60 | 88195d2a21fffff | 156 | {'type': 'Polygon', 'coordinates': [((-0.93053... | 1 |
| 110 | 88195d2b3dfffff | 153 | {'type': 'Polygon', 'coordinates': [((-0.99992... | 1 |
| 104 | 88195d2b31fffff | 143 | {'type': 'Polygon', 'coordinates': [((-0.99551... | 1 |
| 27 | 88195d2869fffff | 125 | {'type': 'Polygon', 'coordinates': [((-0.96651... | 1 |
| 109 | 88195d2b3bfffff | 115 | {'type': 'Polygon', 'coordinates': [((-0.98321... | 1 |
| 117 | 88195d2b51fffff | 98 | {'type': 'Polygon', 'coordinates': [((-0.95162... | 1 |
| 120 | 88195d2b57fffff | 92 | {'type': 'Polygon', 'coordinates': [((-0.95952... | 1 |
| 61 | 88195d2a23fffff | 91 | {'type': 'Polygon', 'coordinates': [((-0.92613... | 1 |
| 119 | 88195d2b55fffff | 85 | {'type': 'Polygon', 'coordinates': [((-0.96392... | 1 |
| 88 | 88195d2b09fffff | 81 | {'type': 'Polygon', 'coordinates': [((-0.97622... | 0 |
df_aggreg_top1 = df_aggreg.head(1).reset_index()
hexmap = choropleth_map(df_aggreg = df_aggreg_top1, with_legend = True)
hexmap
Hex with the largest the number of fails on the map => the above map
HTML("""
<style>
h1 {
text-align:center;
color:#008080;
}
h2 {
text-align:center;
color:#008080;
}
div.inner_cell {
background-color: #F0F8FF;
}
</style>
""")